17 research outputs found
Online Row Sampling
Finding a small spectral approximation for a tall matrix is
a fundamental numerical primitive. For a number of reasons, one often seeks an
approximation whose rows are sampled from those of . Row sampling improves
interpretability, saves space when is sparse, and preserves row structure,
which is especially important, for example, when represents a graph.
However, correctly sampling rows from can be costly when the matrix is
large and cannot be stored and processed in memory. Hence, a number of recent
publications focus on row sampling in the streaming setting, using little more
space than what is required to store the outputted approximation [KL13,
KLM+14].
Inspired by a growing body of work on online algorithms for machine learning
and data analysis, we extend this work to a more restrictive online setting: we
read rows of one by one and immediately decide whether each row should be
kept in the spectral approximation or discarded, without ever retracting these
decisions. We present an extremely simple algorithm that approximates up to
multiplicative error and additive error using online samples, with memory overhead
proportional to the cost of storing the spectral approximation. We also present
an algorithm that uses ) memory but only requires
samples, which we show is
optimal.
Our methods are clean and intuitive, allow for lower memory usage than prior
work, and expose new theoretical properties of leverage score based matrix
approximation
Recommended from our members
Online Row Sampling
Finding a small spectral approximation for a tall n X d matrix A is a fundamental numerical primitive. For a number of reasons, one often seeks an approximation whose rows are sampled from those of A. Row sampling improves interpretability, saves space when A is sparse, and preserves structure, which is important, e.g., when A represents a graph.
However, correct sampling rows from A can be costly when the matrix is large and cannot be stored and processed in memory. Hence, a number of recent publications focus on row sampling in the streaming setting, using little more space than what is required to store the returned approximation (Kelner--Levin, Theory Comput. Sys. 2013, Kapralov et al., SIAM J. Comp. 2017).
Inspired by a growing body of work on online algorithms for machine learning and data analysis, we extend this work to a more restrictive online setting: we read rows of A one by one and immediately decide whether each row should be kept in the spectral approximation or discarded, without ever retracting these decisions. We present an extremely simple algorithm that approximates A up to multiplicative-error 1+ϵ and additive-error δ using O(d log d log(ϵ‖A‖22/δ)/ϵ2) online samples, with memory overhead proportional to the cost of storing the spectral approximation. We also present an algorithm that uses O(d2) memory but only requires O(d log d log(ϵ‖A‖22/δ)/ϵ2) samples, which we show is optimal.
Our methods are clean and intuitive, allow for lower memory usage than prior work, and expose new theoretical properties of leverage score based matrix approximation
A Note on Efficient Computation of All Abelian Periods in a String
We derive a simple efficient algorithm for Abelian periods knowing all
Abelian squares in a string. An efficient algorithm for the latter problem was
given by Cummings and Smyth in 1997. By the way we show an alternative
algorithm for Abelian squares. We also obtain a linear time algorithm finding
all `long' Abelian periods. The aim of the paper is a (new) reduction of the
problem of all Abelian periods to that of (already solved) all Abelian squares
which provides new insight into both connected problems
Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams
In the communication problem (universal relation) [KRW95],
Alice and Bob respectively receive with the promise that
. The last player to receive a message must output an index such
that . We prove that the randomized one-way communication
complexity of this problem in the public coin model is exactly
for failure
probability . Our lower bound holds even if promised
. As a corollary, we obtain
optimal lower bounds for -sampling in strict turnstile streams for
, as well as for the problem of finding duplicates in a stream. Our
lower bounds do not need to use large weights, and hold even if promised
at all points in the stream.
We give two different proofs of our main result. The first proof demonstrates
that any algorithm solving sampling problems in turnstile streams
in low memory can be used to encode subsets of of certain sizes into a
number of bits below the information theoretic minimum. Our encoder makes
adaptive queries to throughout its execution, but done carefully
so as to not violate correctness. This is accomplished by injecting random
noise into the encoder's interactions with , which is loosely
motivated by techniques in differential privacy. Our second proof is via a
novel randomized reduction from Augmented Indexing [MNSW98] which needs to
interact with adaptively. To handle the adaptivity we identify
certain likely interaction patterns and union bound over them to guarantee
correct interaction on all of them. To guarantee correctness, it is important
that the interaction hides some of its randomness from in the
reduction.Comment: merge of arXiv:1703.08139 and of work of Kapralov, Woodruff, and
Yahyazade